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Abstract 

Background: Genome-wide association studies (GWAS) have been used successfully in detecting associations 
between common genetic variants and complex diseases. However, common SNPs detected by current GWAS 
only explain a small proportion of heritable variability. With the development of next-generation sequencing 
technologies, researchers find more and more evidence to support the role played by rare variants in heritable 
variability. However, rare and common variants are often studied separately. The objective of this paper is to 
develop a robust strategy to analyze association between complex traits and genetic regions using both common 
and rare variants. 

Results: We propose a weighted selective collapsing strategy for both candidate gene studies and genome-wide 
association scans. The strategy considers genetic information from both common and rare variants, selectively 
collapses all variants in a given region by a forward selection procedure, and uses an adaptive weight to favor 
more likely causal rare variants. Under this strategy, two tests are proposed. One test denoted by B^sc is sensitive 
to the directions of genetic effects, and it separates the deleterious and protective effects into two components. 
Another denoted by B^scd is robust in the directions of genetic effects, and it considers the difference of the two 
components. In our simulation studies, B^^sc achieves a higher power when the casual variants have the same 
genetic effect, while B^scd is as powerful as several existing tests when a mixed genetic effect exists. Both of the 
proposed tests work well with and without the existence of genetic effects from common variants. 

Conclusions: Two tests using a weighted selective collapsing strategy provide potentially powerful methods for 
association studies of sequencing data. The tests have a higher power when both common and rare variants 
contribute to the heritable variability and the effect of common variants is not strong enough to be detected by 
traditional methods. Our simulation studies have demonstrated a substantially higher power for both tests in all 
scenarios regardless whether the common SNPs are associated with the trait or not. 



Background 

Genome-wide association studies (GWAS) have been 
used successfully in detecting associations between com- 
mon genetic variants and complex diseases. However, 
common SNPs detected by current GWAS only explain a 
small proportion of heritable variability [1]. These identi- 
fied common SNPs usually have a relatively small to 
modest genetic effect, which suggests that another type 
of variants, rare variants, need to be considered in the 
current GWAS. Recent studies showed that common 
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diseases can be caused by causal variants with a wide 
spectrum of allele frequencies including rare alleles [2-4]. 
In addition to the Common Diseases Common Variants 
(CDCV) hypothesis underlying complex-disease etiology, 
an alternative hypothesis, the Common Diseases Rare 
Variants (CDRV) hypothesis has been the topic of much 
recent debate [4]. Under this hypothesis, the analysis of 
accumulative effect of rare variants may become crucial 
in discovering the link between a candidate gene and the 
heritable variability missed by the traditional GWAS. 
There is increasing evidence to support this hypothesis. 
For example, rare variants associated with type I diabetes 
hypertension, sterol absorption and plasma levels of LDL 
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have been detected [5-9]. While some studies have shown 
that rare variants would increase the risk of disease, 
recent studies also indicate that they could play a protec- 
tive' role for complex traits. For example, multiple rare 
variants have been shown to act protectively against type 
I diabetes and hypertension [5,8,9]. With the develop- 
ment of next-generation sequencing technologies, more 
rare variants can be genotyped so the analysis of associa- 
tion between rare variants and diseases becomes possible. 
The availability of the sequencing data offers a great 
opportunity to pursue a very powerful association study 
considering both common and rare variants. However, 
the traditional GWAS only adapts for detecting common 
SNPs. Moreover, it lacks power and requires large sample 
size for detecting rare variants due to their extremely low 
allele frequencies. Hence, the development of more 
powerful statistical tests for association studies using 
both rare and common variants is needed to meet these 
challenges. 

Recently, a strategy that collapses all rare variants across 
a causal region was proposed [10]. The idea behind this 
strategy is to assume that each rare variant in a causal 
region contributes equally to a disease. Therefore, collap- 
sing genotypes across variants would result in enriched 
association signals and a reasonably high frequency allele. 
Several tests based on different collapsing strategies for 
case-control studies were proposed. One is the Cohort 
Allelic Sums Test (CAST) [10], in which the numbers of 
individuals with one or more mutations in a group (e.g. 
gene) are compared between cases and controls. While 
CAST only deals with rare variants, the Combined Multi- 
variate Collapsing (CMC) [11] method generalized it by 
performing a multivariate test with common variants and 
collapsed scores of rare variants. A weighted sum statistic 
[12] is another method, which collapses both common 
and rare variants by adding different weights based on 
allele frequencies assuming that rare variants have a higher 
effect than the common ones. One such weighted sum test 
named ORWSS, whose weights are calculated based on 
odd ratios, is proposed recently by Feng & Elston and Zhu 
[13]. Using the regression approaches proposed by Morris 
& Zeggini [14], those methods can be extended to quanti- 
tative phenotypes. Besides the collapsing strategy, several 
multiple-marker tests have been proposed. Two tests, SSU 
and SSUw based on sum test have been proposed by Pan 
[15,16], which can be applied to either common variants 
or rare variants, but not both. A new adaptive sum strat- 
egy proposed by Pan and Shen [17] achieves a selective 
way to test regions with a few different combinations of 
genetic variants, which is computationally faster and the 
result depends on the order of variants. Logistic kernel- 
machine-based test by Wu [18], which is based on a logis- 
tic regression with a kernel function of multiple SNPs, 
allows for flexible modeling of epistatic and nonlinear SNP 



effects. The power of a single- marker test is usually low 
due to the lack of genetic variant information and the 
need for multiple testing corrections. Multiple-marker 
tests may also lose power because of higher degrees of 
freedom. Collapsing methods can avoid drawbacks from 
both single-marker tests and multiple- marker tests by con- 
sidering all the genetic variant information with only one 
degree of freedom. 

However, collapsing methods have their own limita- 
tions and may not be robust. One limitation is that the 
classification of rare variants is subjective based on a cer- 
tain threshold. Tests considering only rare variants can- 
not utilize genetic information of common variants and 
lose some power as a consequence. Weighted sum statis- 
tics [12,13] were proposed to address this issue by using 
weights based on minor allele frequencies or log odds 
ratios. Another limitation is that collapsing methods can 
be seriously impaired by misclassification of collapsing 
regions [11]. Regions can usually be defined by genes, 
SNP allele frequencies, or variant causality. If all rare var- 
iants within a collapsing region have the same effect on a 
disease, for example deleterious effect, the association 
signal can be amplified; however, if collapsing many non- 
causal variants, it will introduce noise and adversely 
affect power. To address this problem, several methods 
have been proposed recently [19-21]. An adaptive sum 
test has been proposed [19] to collapse SNPs in a region 
where their effects have different directions. Each SNP 
was collapsed positively or negatively based on the mar- 
ginal association between a trait and itself. Some feature 
selection based tests [20,21] have also been proposed for 
rare variants to extract the optimal subset for collapsing 
by the greedy algorithm strategy such as forward selec- 
tion and backward elimination. In this article, we develop 
a weighted selective collapsing method to detect both 
common and rare variants in a genetic region. We argue 
that common and rare variants may share a disease risk 
in the same region. The proposed strategy first selectively 
collapses common variants into two components repre- 
senting the deleterious and protective effects by a for- 
ward selection procedure according to the correlations. 
Secondly, using each component as a base, the rare var- 
iants are selectively combined into components with a 
data-driven weight. The final test statistics are developed 
through a logistic regression model for case control 
studies. 

The proposed strategy tries to consider all information 
in a genetic region, including both common and rare 
variants. It addresses the genetic direction problem by 
using deleterious and protective components and over- 
comes the issue of non-causal variants by applying a for- 
ward selection procedure. To avoid selection bias, a 
permutation procedure is employed to find the P-value. 
The method is designed for candidate gene studies of 
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qualitative traits, but it can also be used for genome 
wide association scan by applying a sliding window 
strategy and be used for any type of traits through a 
generalized linear model. 

Results 

Simulation studies 

In our simulation studies, we check the type-1 error rate 
and compare the power of the weighted selective collap- 
sing method (denoted as B^^sc B^scd) with several 
other tests under various scenarios. The tests are classi- 
fied into three categories based on genetic resources: rare 
variants only, common variants only, and both rare and 
common variants, denoted by R, C, and B, respectively. 
There are three traditional collapsing methods: the indi- 
cator, the sum, and the weighted sum, denoted byind, 
sum, and wSum, respectively. For example, 7?/^^ repre- 
sents the test considering only rare variants in a genetic 
region using an indicator function as collapsing method 
for all rare variants without any selection. B^sum is the 
test using weighted sum collapsing method combining all 
variants, where the weights are based on minor allele fre- 
quencies. The logistic-based single marker test of a com- 
mon SNP with Bonferroni correction is denoted by Q^w- 
The logistic-based multiple marker test for common 
SNPs is denoted by Qo^/^. Let ^/^^ and B^um represent the 
logistic-based multiple marker tests using all the com- 
mon SNPs with an extra fake "common SNP", which is 
obtained by collapsing all rare variants through the indi- 
cator and the sum functions as collapsing methods. The 
selective collapsing method is denoted by SC, and the 
weighted selective collapsing method is denoted by wSC. 
The tests which only selectively collapse rare variants are 
denoted by as R^^^^ and i?^^. Let B^qj^ be the odds ratio 
based weighted sum test. Bgsu and Bssuw are SSU and 
SSUw tests. Bassu and B^ssuw are both adaptive sum tests 
using SSU and SSUw as test statistics for all variants. B^s- 
suord and B^ssuwOrd are adaptive sum tests for ordered 
variants. B^ml is Logistic Kernel-Machine Test. Our pro- 
posed test are denoted by B^sc and B^scd> which selec- 
tively collapse both common and rare SNPs according to 
the squared correlation coefficients and with data driven 
weights. 

Simulated data are generated based on the strategies 
used in previous studies [17,22]. A target region with 
four observed common SNPs and an unobserved causal 
common SNP in the middle is simulated, while 20 
observed non-causal rare SNPs and 8 causal rare SNPs 
are also simulated independently with common SNPs. 
For each sample, common SNPs are generated based on 
a latent variable Z = (Zi, . . . , Z5)' from a multivariate 
normal distribution with covariance structure Corr{Zi, 
Zj) = 0.4 between any two observed components. Each 



observed common SNP has the same chance to corre- 
late with the underlying causal SNP with Corr{Zi, Z3) = 
a 0.4, where a takes values 1 and -1 with probability 
0.5. Each allele on the haplotype is generated with a 
minor allele frequency obtained from a uniform distri- 
bution between 0.1 and 0.3. Rare variants are generated 
independently with common SNPs, which are also from 
a multivariate normal distribution. Within each group of 
no causal rare variants and causal rare variants, LD 
structure is defined by Corr{Zi, Zj) = 0.4'^'^'. Each allele 
on a haplotype is generated with the cut-off of the 
minor allele frequency obtained from a uniform distri- 
bution between 0.001 and 0.005. Next, genotypes Xi = 
(X/i, . . . , X/32)' for each individual are generated by the 
sum of two haplotypes. Last, the phenotype Yi is gener- 
ated based on the logistic regression model with a given 
odds ratio and the order of genotypes have been 
shuffled. We consider five scenarios here. Scenario A is 
the null case where the odds ratios for all variants are 
set at 1. In Scenario B, rare variants are associated with 
the trait but common variants do not. We randomly 
selected eight with the customized odd ratio by para- 
meter, OR between 1.3 and 3.1. Odd ratio of the half 
rare variants is defined as OR and another half is 
defined as OR plus one. For example, if OR is 2, then 
we consider Odds Ratio = (2, 2, 2, 2, 3, 3, 3, 3) for eight 
casual rare variants. In Scenario C, both common and 
rare variants have effects on the traits, but effects from 
common variants are not significant enough to be 
detected by traditional association approaches. The odds 
ratio of the unobserved causal common SNP is set at 
1.5. The odds ratios for rare variants are set in the same 
fashion as in Scenario B. Scenario D, which is quite 
similar to Scenario B, has a different odds ratio structure 
for rare variants. The odds ratio for half of them is set 
to be positive, while it is set to be negative for the rest. 
For example, if OR is 2, then we consider 
Odds Ratio = (2, 2, j, j, 3, 3, |, |) for eight casual rare 
variants to reflect possible different genetic effect. Sce- 
nario E is the counterpart version of Scenario C consid- 
ering odds ratios to reflect possible different directions. 
500 cases and 500 controls are simulated in the study 
with 1000 simulation replicates and the significant level 
was set at 0.05 for all scenarios. 

Type-I error rate and Power 

For tests requiring a permutation procedure, a quicker 
way for calculating P- values is to simulate a large sam- 
ple of test statistics from the asymptotic null distribu- 
tion. We randomly select 1,000 simulation replicates 
and shuffle the phenotype data 1,000 times to generate 
data under the null hypotheses and compute the tests 
statistics for the asymptotic null distribution. We first 
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consider Scenario A to check the type-I error rate. In 
Table 1 we can see that all tests have satisfactory Type I 
error rates. 

Under the alternative hypothesis, we first consider the 
case where all rare variants have the same genetic effect 
on the trait. In scenario B, where only rare variants are 
associated with the trait, we consider tests R and B, a 
total of 17 tests. The result is shown in Table 2. The 
proposed test B^^sc achieves the highest power under 
different OR. Roughly speaking, B^sa B^sum> R^um> ^ind 
and BassuwOrd are the top five tests among 17 tests. 
Multivariate tests with common variants and an extra 
component from rare variants, Bi^^ and Bsum> have low 
power as expected, because common variants do not 
contribute to the trait variability so they are just noise. 
Rsum has a consistently better performance than Rind- 
However, among all variants, more than half of them 
are non-causal, which are also noise in this case. 
Directly collapsing without any selection would lead to a 
loss of power. R^^^ and R^^^ achieve a relative higher 
power than and Rsum by a selection procedure to 
remove the noise from the non-causal rare variants. 
BwSum> on the other hand, puts more weight on the rare 
variants to reduce noise in this scenario, resulting a bet- 
ter performance than previous tests. However, as shown 
in the appendix, the weights based on the estimated 
minor allele frequencies from controls tend to favor 
those deleterious rare variants and to ignore the protec- 
tive rare variants. Thus, scenario B, where all causal var- 
iants are deleterious, is the optimal case for B^sum- B^sc 
achieves the highest power by considering both com- 
mon and rare variants with a selection procedure and a 
data driven weight which could benefit both deleterious 
and protective rare variants and reduce noise. B^^^qr has 
a lower power in this simulation study, because, for a 
region with the limited number of variants, we used the 
weights from log odds ratios without additional thresh- 
old. This may not be significantly enough to distinguish 
the true signal and noise. In this simulation study, the 
order of all variants is shuffled to have a fair comparison 
with adaptive tests. B^ssuwOrd achieves a higher power 



Table 2 Power for all tests In simulated data of scenario 
B, no common SNPs effect, effects of RVs are in the same 
directions 



OR 


1.3 


1.6 


1.9 


2.2 


2.5 


2.8 


3.1 




0.227 


0.376 


0.522 


0.63 


0.737 


0.81 


0.851 


^sum 


0.245 


0.424 


0.57 


0.67 


0.778 


0.846 


0.888 


Bind 


0.129 


0.204 


0.318 


0.419 


0.522 


0.623 


0.698 


^sum 


0.147 


0.243 


0.343 


0.47 


0.565 


0.674 


0.751 


T)SC 

ina 


0.295 


0.42 


0.589 


0.726 


0.834 


0.884 


0.954 


sum 


0.298 


0.425 


0.588 


0.731 


0.834 


0.894 


0.946 


^wSum 


0.302 


0.474 


0.631 


0.71 


0.81 


0.875 


0.931 


BwOR 


0.09 


0.17 


0.226 


0.295 


0.416 


0.408 


0.58 


Bkml 


0.044 


0.054 


0.057 


0.067 


0.08 


0.074 


0.078 


Bssu 


0.042 


0.049 


0.053 


0.062 


0.075 


0.071 


0.07 


Bssuw 


0.136 


0.257 


0.386 


0.592 


0.706 


0.814 


0.866 


BaSSU 


0.074 


0.106 


0.197 


0.219 


0.275 


0.324 


0.351 


BqSSUw 


0.161 


0.243 


0.378 


0.504 


0.691 


0.755 


0.823 


BaSSUOrd 


0.234 


0.325 


0.468 


0.628 


0.738 


0.849 


0.877 


BaSSUwOrd 


0.211 


0.293 


0.462 


0.629 


0.793 


0.847 


0.896 


BwSCd 


0.201 


0.34 


0.445 


0.586 


0.734 


0.825 


0.885 


BwSC 


0.316 


0.509 


0.654 


0.775 


0.892 


0.927 


0.97 



There is a customized LD structure among common variants and among rare 
variants. 



Randomly selected eight rare variants are casual variants. Others are non- 
casual variants. Genetic effect parameter OR for eight rare variants is listed in 
the table. If OR is 2, Odds Ratio = (2, 2, 2,3, 3, 3) for eight casual rare variants. 
Notations of tests are defined similarly those in Table 1. 

by sorting the genotypes according to single test statis- 
tics and performs an adaptive SSUw test. B^ssuwOrd has 
a consistently better performance than Bgsuw and B^ssuw 
in both cases. SSUw based tests have a consistently bet- 
ter performance than SSU based tests. 

When the effect of rare variants is relatively weak (OR 
is from 1.3 to 2.2), Ri^^ and Rsum perform better than 
BaSSUwOrd' B^ml and Bgsu have the lowest power in this 
simulation study. B^ml has a consistently better perfor- 
mance than Bssw In scenario C, both common and rare 
variants are associated with the trait, but the association 
between common SNPs and the trait is not strong 



Table 1 Type I error rates for all tests In simulated data of scenario A 

Test Type-1 error Test Type-1 error Test Type-1 error Test Type-1 error 

^■,nd 0^054 R^ 005^ Bssu 0^053 B.ssuord 0.06 

f^sum 0^53 R^ 0054 B_ssu^ 0^042 Bgssu^ord 0-062 

Cbon 0^54 B_in^ 0055 B_assu 0^62 B_^ 0.042 

Ciogit O055 B_^ O058 Bgssuw 0^055 B_^ 0.051 

Bwsum 0.055 B^oR 0.062 Bkml 0.056 

There is customized LD structure among common variants and among rare variants. Rmd, collapsing method by indicator function on rare variants. Rsum, 
collapsing method by sum function on rare variants, i??^/ selective Rmd- R^^ selective Rsum ■ Cbon, single test with bonferroni correction on common variants. 
Qog/f/ multivariate logistic regression test on common variants. B-md and Bsum^^iogit with collapsed component from rare variants. B^sum, weighted sum test. B^or 
Odds Ratio based weighted sum test. Bssu, ^ssuw , SSU based tests. Bassu, ^assuw, adaptive sum tests. Bassuord, ^assuword ordered adaptive sum tests. Bkml, Logistic 
Kernel-Machine Test. B^sc, Bwscd selectively weighted collapsing. 
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enough to be detected by the traditional association 
methods. We considered all 19 tests, the results are 
shown in Table 3. Our test, B^sa achieves the highest 
power in most case of OR, except when OR = 1.9, 
BwSum has a slight higher power. Roughly speaking, B^sc 
BwSum^ R^sum^ R'nd ^assuwOrd are the top five among 
19 tests. Bj^ML and Bgsw either using a linear kernel or 
without using any weights on rare variants, result in the 
same power as C^^w and Ciogit in this simulation study. 
The results of selected tests in scenario B and C, where 
all rare variants have the same genetic effect on the 
trait, are shown in Figure 1 to demonstrate the 
comparison. 

Now, we consider scenarios D and E where rare var- 
iants have different genetic effect on the trait. Tables 4 
and 5 show the results of these two scenarios. B^scd> 
achieves the highest power in scenario D for most case 
of OR. When OR =1.9 and 2.8, BassuwOrd achieves the 
highest power. When OR = 3.1, Bassuord achieves the 
highest power. Roughly speaking, B^scd^ B^ssuwOrd^ Bas- 
suord> Bassuw and B^sc are the top five tests among 17 
tests in scenario D. In scenario E, B^scd and B^gsuwOrd 
achieve the highest power in most cases. When OR = 



Table 3 Power for all tests in simulated data of scenario 
C weak common SNPs effect, effects of RVs are in the 
same direction 



OR 


1.3 


1.6 


1.9 


2.2 


2.5 


2.8 


3.1 


f^ind 


0.237 


0.394 


0.472 


0.6 


0.715 


0.785 


0.843 




0.247 


0.418 


0.543 


0.636 


0.747 


0.811 


0.869 


^bon 


0.163 


0.157 


0.144 


0.164 


0.174 


0.191 


0.193 


^logit 


0.195 


0.199 


0.193 


0.207 


0.212 


0.228 


0.238 


Bind 


0.278 


0.364 


0.436 


0.517 


0.618 


0.677 


0.76 


Bsum 


0.298 


0.384 


0.461 


0.562 


0.668 


0.735 


0.795 


pSC 


0.236 


0.43 


0.565 


0.702 


0.781 


0.888 


0.91 


uSC 

sum 


0.238 


0.446 


0.605 


0.705 


0.815 


0.892 


0.92 


ByvSum 


0.341 


0.534 


0.658 


0.703 


0.846 


0.87 


0.911 


BwOR 


0.253 


0.312 


0.344 


0.475 


0.456 


0.582 


0.648 




0.167 


0.186 


0.186 


0.19 


0.204 


0.199 


0.2 


Bssu 


0.165 


0.179 


0.179 


0.181 


0.192 


0.192 


0.188 


Bssuw 


0.203 


0.334 


0.458 


0.61 


0.716 


0.808 


0.861 


BaSSU 


0.168 


0.215 


0.235 


0.28 


0.303 


0.34 


0.383 


BaSSUw 


0.181 


0.346 


0.399 


0.546 


0.64 


0.755 


0.819 


BaSSUOrd 


0.163 


0.293 


0.376 


0.571 


0.592 


0.733 


0.798 


BaSSUwOrd 


0.238 


0.367 


0.506 


0.663 


0.732 


0.847 


0.89 


BwSCd 


0.21 


0.395 


0.484 


0.625 


0.661 


0.822 


0.848 


BwSC 


0.344 


0.538 


0.631 


0.778 


0.85 


0.935 


0.954 



There is customized LD structure among common variants and among rare 
variants. 



The OR for underlying common SNP is 1.5. Genetic effect parameter OR for 
eight rare variants is listed in the table. If OR is 2, Odds Ratio = (2, 2, 2, 2, 3, 3, 
3, 3) for eight casual rare variants. Notations of tests are defined similarly as 
those in Table 1. 



1.3 and 1.6, B^qr achieves the highest power. When 
OR = 1.3, 2.8 and 3.1, B^gsuwOrd has a higher power 
than By,scd' When OR = 1.6, 1.9 and 2.5, B^gcd has a 
higher power. When OR = 2.2, they both achieve the 
same power. Roughly speaking, B^scd> B^ssuwOrd^ B^sa 
Bassuw and Bgsuw are the top five test among 19 tests. 
Being different from the results of scenarios B and C, 
the power of B^sum drops significantly, because the 
weights in B^gum only favor those deleterious rare var- 
iants and ignore the protective rare variants, which are 
as important as deleterious ones in this simulation. 
Although B^Q]^ achieves a low power because of limit 
number of variants, B^^qr has performed consistently 
better than B^sum in most cases under both scenarios. 
Due to the presence of the causal rare variants with 
opposite association directions and non-causal rare var- 
iants, other tests involving directly collapsing methods 
also have a lower power. On the other hand, SSU and 
SSUw based tests tend to perform well under these sce- 
narios. B^ssuwOrd becomes one of the most powerful test 
in these two scenarios. We find that SSUw based tests 
combine both deleterious and protective genetic varia- 
tions into the test statistic SSUw, while most collapsing 
methods only consider one of them. Having the same 
merit of B^ssuwOrd^ our second proposed method B^scd> 
which is based on the difference of the two components, 
achieves the higher power in most cases. The results of 
selected tests in scenario D and E, where rare variants 
have different genetic effect on the trait, are shown in 
Figure 2. 

Discussion 

In this paper, we proposed two novel association tests 
for candidate gene studies and genome wide association 
studies. The test B^sc selectively collapses common and 
rare variants into two separate components with data- 
driven weights. The test statistic is derived by compar- 
ing these components, which is robust in situations with 
or without common variants. A permutation procedure 
is employed to find the P-value. Simulation studies 
show that the proposed tests achieve a higher power 
than other commonly used tests for rare variants in 
most cases. The optimal scenario for the proposed test 
is that when the common and rare variants both contri- 
bute to the heritable variability and effects of common 
variants are not detectable by traditional methods using 
common variants alone. If there is no association 
between the common variants and the trait, the pro- 
posed method also performs robustly as well as demon- 
strated by our simulation studies. We believe that the 
improved power comes from three sources. First, the 
test considers more genetic information by combining 
both common and rare variants instead of dealing with 
rare variants alone. Second, the test filters out the 
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Odd Ratio Odd Ratio 

Figure 1 Power comparison in scenarios B and C. Selected tests are considered for power comparison in scenarios B and C, wliere all causal 
rare variants have the same genetic effect on the trait. Scenario B is the case that only rare variants affect the traits, while Scenario C is the case 
that both common and rare variants affect the traits. RSCsum represents selective Rsum- RSCind represents selective Rmd- BwSum represents 
weighted sum test. BaSSUwOrd represents ordered adaptive sum test with test statistics of SSUw. BwSC represents weighted selectively 
collapsing test sensitive to the direction. 



suspicious non-causal variants as noise and separates the 
variants into deleterious ones and protective ones by the 
selective collapsing method. Distinguishing deleterious 
and protective sources can improve the power when 
variants have different genetic effect on the trait. For 
example, in the worst case scenario, common variants 
have a deleterious effect, while rare variants collectively 
have a protective effect on the trait. The effects from 
the two sources will be neutralized if the effect direc- 
tions are not distinguished. Our test can achieve a high 
power by choosing the strongest source in any cases 
instead of neutralizing them. The third reason for the 
improvement of the power comes from the data driven 
weights. Instead of using weights based on estimates of 
the minor allele frequencies from control data, which 
favor those deleterious rare variants and ignore the pro- 
tective rare variants, the proposed test uses weights 
based on an estimate of the disease risk, which is the 



probability of an individual with disease mutation. The 
proposed weights tend to favor both deleterious and 
protective rare variants. 

Although the proposed test {B^sc) has many advan- 
tages, it is certainly not universally better than other 
tests. For example, in scenarios D and E, when the 
mixed genetic effect exists, B^sc can only capture the 
genetic effect in one direction. It can be used for detect- 
ing variants with the same genetic effect direction. 
Therefore, we also propose another test B^scd> which 
can capture all genetic effect. It can be used for detect- 
ing a region of variants with opposite directions of 
genetic effects. We also would like to point out that the 
proposed test can be easily extended to include covari- 
ates since the tests are based on a logistic regression 
model. It can also be applied to quantitative traits by 
using a linear regression model. The strategy that col- 
lapsing rare variants based on common variants for 
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Table 4 Power for all tests in simulated data of scenario 
D, no common SNPs effect, effects of RVs are in the 
different directions 



OR 


1.3 


1.6 


1.9 


2.2 


2.5 


2.8 


3.1 




0.062 


0.058 


0.089 


0.095 


0.118 


0.129 


0.164 




0.054 


0.062 


0.092 


0.083 


0.113 


0.118 


0.158 


Bind 


0.062 


0.06 


0.059 


0.074 


0.085 


0.1 


0.128 


Bsum 


0.062 


0.059 


0.065 


0.073 


0.09 


0.101 


0.117 


uSC 


0.09 


0.15 


0.214 


0.221 


0.314 


0.352 


0.395 




0.094 


0.151 


0.202 


0.21 


0.335 


0.353 


0.449 


BwSum 


0.107 


0.096 


0.096 


0.136 


0.179 


0.221 


0.27 


BwOR 


0.09 


0.126 


0.133 


0.165 


0.211 


0.222 


0.255 


Bkml 


0.061 


0.055 


0.054 


0.054 


0.067 


0.067 


0.072 


Bssu 


0.056 


0.053 


0.052 


0.05 


0.062 


0.062 


0.068 


Bssuw 


0.095 


0.126 


0.181 


0.254 


0.314 


0.354 


0.478 


BaSSU 


0.086 


0.087 


0.13 


0.138 


0.167 


0.162 


0.229 


BaSSUw 


0.114 


0.145 


0.198 


0.271 


0.311 


0.373 


0.456 


BaSSUOrd 


0.113 


0.175 


0.241 


0.289 


0.39 


0.409 


0.566 


BaSSUwOrd 


0.129 


0.2 


0.256 


0.321 


0.385 


0.468 


0.543 


BwSC 


0.135 


0.148 


0.2 


0.227 


0.297 


0.373 


0.465 


BwSCd 


0.134 


0.197 


0.25 


0.34 


0.391 


0.441 


0.558 



There is a customized LD structure among common variants and among rare 
variants. 



Randomly selected eight rare variants are causal variants. Others are non-causal 
variants. Genetic effect parameters OR for eight rare variants are listed in the 
table. Odds Ratios for another half of rare variants are in different directions. If 
OR is 2, Odds Ratio = (2, 2, j, 3, 3, |, |) for eight casual rare 
variants. Notations of tests are definecT similarly as those in Table 1. 



qualitative trait in GWAS has been successfully applied 
to the simulated sequencing data from Genetic Analysis 
Workshop 17 [23], where a GWAS permutation proce- 
dure of our method was proposed for qualitative trait as 
well. 

Conclusions 

In summary, we proposed two weighted selectively col- 
lapsing tests for both candidate gene studies and gen- 
ome-wide association studies; in the latter case, the 
analysis unit can be based on genes, pathways, or sliding 
windows. The two tests are potentially powerful meth- 
ods for association studies in sequencing data by com- 
bining all variants information, by filtering out 
suspicious non-causal variants, and by using adaptive 
weight on likely causal rare variants. One test is robust 
in the directions of genetic effects, and it adapts to the 
region with mixed genetic effects. Another test is sensi- 
tive to the directions of genetic effects, and it adapts to 
the region with same genetic effect. It is designed 
mainly for detecting rare variants, and it achieves a 
higher power by considering common variants when 
needed. Our simulation studies have demonstrated their 



Table 5 Power for all tests in simulated data of scenario 
E, weak common SNPs effect, effects of RVs are in the 
different directions. 



OR 


1.3 


1.6 


1.9 


2.2 


2.5 


2.8 


3.1 


^ind 


0.045 


0.077 


0.068 


0.103 


0.115 


0.12 


0.157 


^sum 


0.054 


0.074 


0.062 


0.091 


0.109 


0.126 


0.154 


^bon 


0.156 


0.131 


0.155 


0.139 


0.186 


0.149 


0.146 


^logit 


0211 


0.185 


0214 


0.192 


0.221 


0211 


0.19 


Bind 


02 


0.184 


02 


0.198 


0.244 


0225 


0.233 


Bsum 


0.19 


0.182 


02 


0.197 


0.243 


0226 


0.229 


nSC 

ind 


0.068 


0.122 


0.176 


0.241 


027 


0.359 


0.387 




0.094 


0.119 


0.193 


0254 


0273 


0.371 


0.39 


B c 


0.1 


0.1 14 


0.164 


0.172 


0.193 


0.236 


0.272 


p 

DwOR 


U.zU 1 


U.z4j 


U.ZD 


n Q 1 1 


U.oo4 




U.4Uj 


Bkml 


0.169 


0.167 


0.186 


0.159 


0.197 


0.173 


0.171 


Bssu 


0.166 


0.161 


0.175 


0.153 


0.189 


0.167 


0.163 


Bssuw 


0.146 


0.169 


0.241 


0.306 


0.395 


0.445 


0.521 


BqSSU 


0.139 


0.148 


0.196 


0.185 


0212 


0256 


0.255 


BqSSUw 


0.183 


0.196 


0.233 


0.302 


0.354 


0.459 


0.476 


BaSSUOrd 


0.127 


0.164 


0213 


0276 


0.334 


0.45 


0.5 


BaSSUwOrd 


0.224 


0.206 


0.293 


0.386 


0.449 


0.571 


0.593 


BwSC 


0.133 


0.182 


0.256 


0.332 


0.357 


0.479 


0.48 


BwSCd 


0.19 


0217 


0.308 


0.386 


0.468 


0.568 


0.548 



There is a customized LD structure among common variants and among rare 
variants. The OR for underlying common SNP is 1.5. Randomly selected eight 
rare variants are causal variants. Others are non-causal variants. Genetic effect 
parameters OR for eight rare variants are listed in the table. Odds Ratios for 
half of rare variants are in different directions. If OR is 2, 
Odds Ratio = (2, 2, j, j, 3, 3, |, j) for eight casual rare variants. 
Notations of tests are defined similarly as those in Table 1. 

substantially higher power in all scenarios by combining 
advantages from other existing tests. 

Method 

We focus on qualitative traits only in this study. It can 
be easily extended to any other traits through a general- 
ized linear model. Different variants and collapsing stra- 
tegies are considered within the framework of logistic 
regression. We also compared some recently proposed 
methods, SSU tests[15], adaptive tests [17], ORWSS [13] 
and Logistic Kernel-Machine Test [18] in our simulation 
study. The goal of this work is to detect any association 
between the trait and a given genetic region which 
includes both common and rare variants. Consider an 
association study with N samples in a genetic region 
with K variants. Let Y/ denote the coded trait for the ith 
sample, 0 for controls and 1 for cases. The variants 
were coded by an additive genetic model: Xu^ was coded 
as 0, 1, and 2 as genotype scores for the kth marker of 
the ith sample, where / = 1, . . . , M and /c = 1, . . . , /C. 
Let Xj^ and Xf^ be common variants and rare variants 
based on a certain threshold. For example, SNPs with 
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Power comparison in scenario D 



Power comparison in scenario E 



o 

Q_ 



BwSCD 

BaSSUwOrd 

BaSSUOrd 

BaSSUw 

BwSC 




1.5 



2.0 



2.5 



3.0 



BwSCD 

BaSSUwOrd 

BaSSUOrd 

---- BaSSUw 
BwSC 




1.5 



~~r~ 

2.0 



— T" 

2.5 



— T" 

3.0 



Odd Ratio Odd Ratio 

Figure 2 Power comparison in scenarios D and E. Selected tests are considered for power comparison in scenarios D and E, wliere causal 
rare variants have different genetic effect on the trait. Scenario D is the case where only rare variants affect the traits, while Scenario E is the 
case where both common and rare variants affect the traits. BwSCd represents weighted selectively collapsing test robust in the direction. 
BaSSUOrd represents ordered adaptive sum test with test statistics of SSU. BaSSUw represent adaptive sum test with test statistics of SSUw. 
Other names of tests are defined similarly as in Figure 1. 



minor allele frequencies less than 0.01 are considered as 
rare variants. 

Collapsing Methods and Logistic Regression 

Collapsing approaches have been previously proposed 
using either an indicator function or a sum (proportion) 
function [11,14]. Let Si denote the collapsed score for a 
genetic region. The indicator function based collapsing 
method is Sj = IiYlk=i^fk) sum (proportion) 

function based collapsing method is Si = J2k=i ^fr 

In a case control study, it is natural to consider the 
logistic regression model for tests, and those collapsing 
methods can be achieved by: Logit Pr(y^ = 1) = Po> + 
PiSi, The null hypothesis of no genetic effect is Hq : Pi 
= 0. In a candidate gene study, we employed the likeli- 
hood ratio test. Because the score test is computation- 
ally faster than the likelihood ratio test, we use the 



following tests for the genome wide association study. 
Let 

N 

1=1 

and 

N 

V=Y{l-Y)J2iSi-Sf 

1=1 

where y = ^ S = ^- 

N N 
The score test is 
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which has an asymptotic distribution with degrees 
of freedom one. 

The Umitation of the current collapsing approaches is 
that they only consider rare variants. For example, when 
common variants contribute to the heritable variability 
not detectable by the traditional common SNPs 
approaches, ignoring them will lose power of the tests. 

The Combined Multivariate Collapsing method 
(CMC) [11] solves this problem by regarding collapsed 
score as a common SNP and performing a Hotelling's 
test on multiple markers. To put this method within 
our logistic regression framework, we consider a multi- 
variate logistic regression model. 



Logit Pr(7, = 1) = ^0 + PiSr + J2 



The null hypothesis of no genetic effect is 
Ho : Pi = = 0. 

Another collapsing method uses a data-driven weight 
considering both common and rare variants. 



where the weight is calculated by 

^iGcontrol -^ik 



iecontrol ik number of controls in 



2No + 2 



the study [12]. By using a weight, the collapsed score 
amplifies the contribution of rare variants. The test sta- 
tistic can be derived from logistic regression as before. 
Because the weights are data-dependent, a permutation 
test is employed to find P-values. 

For a region with both common and rare variants, the 
above two approaches consider all the genetic informa- 
tion. However, it is impossible that all variants in this 
region contribute to the heritable variability, and it is 
more Ukely that only some of them are causal. If many 
of rare variants are non-casual, collapsing will inevitably 
introduce noise and lose power of the test. 

A covering method called RareCover [21], has been 
recently proposed to determine a collapsing subset from 
all the variants in this region using a forward selection 
procedure. For the purpose of comparison, we also put 
this strategy in our logistic regression framework. 
Instead of using Pearson's which was used by the 
original authors, we considered the squared correlation 
coefficient as the screening test statistic. Starting 
from a score without any rare variants, each rare variant 
is examined, and it is added into this score if it 
improves the test statistic the most. An optimal subset 
was obtained by a forward selection procedure to 
achieve the highest squared correlation between the 



collapsed score and traits. The test statistic then can be 
derived from a logistic regression model between the 
trait and the collapsed score as before. P-value can be 
found by permutation. However, this method does not 
consider genetic information from the common variants 
in this region and it ignores the direction of the rare 
variants by using either the squared correlation coeffi- 
cient or Pearson's x^- 

Recent proposed multi-marker tests 

We also compared some recently proposed methods, 
SSU tests[15], adaptive tests [17], ORWSS [13] and 
Logistic Kernel-Machine Test [18] in our simulation stu- 
dies. We briefly review these methods here. SSU and 
SSUw tests are defined as follow. 

Let the score vector LI = {Ui, , , , , Uk), where each 
component L/^ = Yl^i^ikV^i ~ ^} Y are the sample 
mean of phenotype. 

SSU = U'U And SSUw = U'Diag{If)-^U, Where // = 
Cov{U) is the expected fisher information matrix. 
Asymptotic distributions of the above two test statistics 
are scaled;^ distributions [15]. 

For the Adaptive test, suppose that Uy^ = {Ui, . . . , 
U^), where m<K, is the vector containing the first m 
components. Adaptive test statistics is 

aT = mini<m<KPy^KT[Um)) 

where Pval{T{U^)) is the p-value of the test statistic, 
T. For the Adaptive test, we used SSU and SSUw as the 
score of the test statistics T. The adaptive tests are 
called aSSU and aSSUw tests. More generally, one can 
order the SNPs based on the single test statistics and 
repeat the adaptive test process, resulting in the aSSU- 
Ord and aSSUw-Ord. The P-value oi aT is calculated by 
a permutation procedure. 

For the ORWSS test, the score is constructed in the 
same way as other weighted sum test. 



Si = ^ iVkXik 



but the weight is calculated as follow. 

The amended estimator of the odds ratio is computed 
by adding 0.5 to each cell of the 2 by 2 table for case 
control studies. If we define //^ = log{ORj^y where ORj^ is 
the odds ratio for the kth marker. 



Wk = 



Yk iflVk - Ykl 



0 



> ccr 



otherwise 



where a is the standard deviation calculated from y^, 
k = 1, . . . , K, c is a parameter and fk is the mean of 
log odds ratios[13]. In the simulation study, because 
number of variants is small, we using the logarithm of 
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odds as a weight directly for each SNP without 
classification. 
Then the test statistic is defined as 



ORWSS = J2 ranfe(Si) 



ieCase 

P-value of ORWSS is calculated by a permutation 
procedure. 

For the Logistic Kernel-Machine Test, the test statis- 
tics is based on logistic regression with a kernel function 
of the SNPs. 

Logit Pr(y, = l) = Po + KXn. ■ ■ ■ ,X,k) 

Some commonly used kernels include linear, identity- 
by-descent (IBS) and quadratic kernels. We only con- 
sider the linear kernel here. In order to test whether 
there is a true genetic effect, the null hypothesis is Hq : 
h{X) = 0. The test statistics has been developed as 

(y - fyKfj - Y) 



which follows a scaled distribution [18]. 

For all the tests above, we considered both common 
and rare variants, since we want to develop a robust 
strategy to detect any association between complex traits 
and genetic regions considering both common and rare 
variants. 

Weighted Selective Collapsing Strategy 

Now, we propose a new collapsing strategy, which con- 
siders genetic information from both common and rare 
variants. The new strategy tries to remove the noise 
generated by the non-causal variants and to improve the 
power by considering both deleterious and protective 
components of this region. In brief, our strategy is as 
follows. We defined rare variants as SNPs with minor 
allele frequencies less than 0.01, others as common var- 
iants. Starting from a null model without any variants, 
by a forward selection procedure, common SNPs are 
first selectively collapsed into two components, which 
will serve as bases for the rare variants. One is a deleter- 
ious component having an extremely positive correlation 
coefficient with the trait. Another is a protective compo- 
nent having an extremely negative correlation coeffi- 
cient. Because rare variants have high genetic effects, 
they were added into the collapsed set one at a time by 
a weighted sum function until either there were no var- 
iants remaining, or there was no improvement of the 
correlation coefficient. Repeat the forward selection pro- 
cedure without common variants as the basis, two more 
components were generated. Last, the collapsed score 
was obtained from the four components according to 
the measure of squared correlation coefficient with the 



trait. The test statistic then can be derived from a logis- 
tic regression model between the trait and the collapsed 
score as before. P-values can be computed by 
permutation. 

Now, we describe the procedure in details. Assume 
there are / common variants and K rare variants within 
a certain predefined genomic region. Let and 
denote vectors across all samples for common and rare 
variants, defined by a threshold MAF = 0.01, where ; = 
1, . . . , /, and /c = 1, . . . , /<r. Let S+ denote the deleter- 
ious component, which is a vector collapsed by the sub- 
set of the SNPs to achieve an extremely positive 
correlation. Let S_ denote the protective component, 
which is a vector collapsed by the subset of the SNPs to 
achieve an extremely negative correlation. 

Step 1: Forward selection on common SNPs with sum 
collapsing. 

a) Calculate the correlation coefficient R for each 
common SNP with the trait. The common SNP with 
the largest correlation coefficient is added into S^^^, 
while the common SNP with the lowest correlation 
coefficient is added into S"^^. 

{Cor(T+, Y)—Cor[T^, Y) > 0} 
and 

{-Cor{T_,Y)—Cor{T_,Y) < 0} 

where collapes{S+,X^) is the sum of the vector S+ and 
Xf, for ; = 1, . . . , /. 

b) Update and S_ with ^ and S^^^. Let ; take 
values only from the remaining common SNPs. Repeat 
a) until either all common variants are collapsed into 
components or there is no improvement for the correla- 
tion coefficient of each component. 

Step 2: Forward Selection on rare SNPs with weighted 
sum collapsing. 

a) Because rare variants have high genetic effects, the 
data driven weight is derived as follows to favor the rare 
variants with high genetic effect in both deleterious and 
protective way. 



Wk = 



Pk 



J2kPk 



K 



i#{y/=i,xg>0} 



-0.51. 



^h^^^ Pfe-I #{Xf,>0} 

Xfj^ > 0 indicates a mutation for the ith sample in the 
/<th rare variant, is the empirical estimate of the prob- 
ability that an individual with the mutation will have the 
disease, w^ is adjusted based on p^ with the constraint 
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that the sum of the weights is the number of rare 
variants. 

b) Calculate the correlation coefficient R for each rare 
SNP with the trait. The rare SNP with the largest corre- 
lation coefficient is added into S^^^, while the rare SNP 
with the lowest correlation coefficient is added into S^^^. 

{Cor(T+, Y)—Cor[T^, Y) > 0} 
and 

{-Cor{T-,Y)—Cor{T_,Y) < 0} 

where collapes[S+,X^) is the sum of the vector S+ and 
WkX^, for /c = 1, . . . , IC 

c) Update and S_ with ^ and Let k take 
values only from the remaining rare SNPs. Repeat b) 
until either all rare variants are collapsed into compo- 
nents or there is no improvement for the correlation 
coefficient of each component. The whole procedure 
generates two collapsed scores S^^^^, S^^^ representing 
deleterious and protective components for respectively 
rare variants based on common variants. 

Step 3: Construct the final collapsed score. Repeat 
Step2 considering rare variants only without the bases 
from common variants. Thus, our test can be robust 
when common SNPs are not associated with the trait. It 
will generate another two components, and S^. The 
final collapsed score is derived as follow. 

Swsc = argmaxTeA{Cor[T, Yf} 

where A = {Sf f^ S^_'^^, , S^} 

The test statistic (wSC) can be derived from a logistic 
regression model between the trait and the collapsed 
score as before. P-values can be computed by 
permutation. 

Swsc is constructed by comparing the potential effect 
of components in different directions. As an alternative, 
we also propose a method (wSCd) to detect the genetic 
effects and it is robust when the effects are in different 
directions. To find wSCd, we will follow all the same 
steps described before in deriving wSC, but the final col- 
lapsed score is 

Swscd = argmaxTeA{Cor[T, Yf} 

where A = {Sf - S^_'^^, - S^} 

Appendix 

In the appendix, we show that the weight defined by 
= ~^jT==^[l2] tends to favor those deleterious rare 



variants and ignore the protective rare variants. Instead 
of using estimated minor allele frequencies, let q be 
minor allele frequency in controls, and let p be the 

minor allele frequency in case. Then ^ = ^ q[i-qy ^ 
and Wc should have similar behavior. 

By its definition w is a decreasing function of q, where 
q G (0, 0.5). Let R denote the odds ratio of case and 
control groups and r be the minor allele frequency in all 
samples for a given SNP. We have 



R: 



and 



1-p 



I 1-^ 



^caseV + ^control^ 



control 



where Ncase> ^control ^^e the number of samples in 
cases, controls, respectively. The above equation can be 
written as 



q = r- 



NcaseiP - q) 
^case + ^control 



The relationship between p and q can be easily 
derived based on the value of R as follows. 



If R> 1, 



P / ^ 
i-p/ 1- 



>l^p>q^q<r. 



IfR<l,^/^<l^p<q^q>r. 

Let ^0 = ^ ^^^_^y which is the weight for any non-cau- 
sal variant (R = 1). If rare variants have deleterious 
genetic effect, then R > 1 and w > Wq. If rare variants 
potentially have protective genetic effect for the disease, 
then R < 1 and w < Wq. This shows that the weight 

defined by = ^y|=|j[12] tends to favor those deleter- 
ious rare variants and ignore the protective rare variants. 
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